colors
Red 43
Orange 211
Green 341
Blue 1389
image:geeksforgeeks
It is the amount of spread or variability among raw scores in a distribution.
image:geeksforgeeks
The ‘sharpness’ of the peak of a frequency-distribution curve.
The ‘sharpness’ of the peak occurs due to the tailedness (i.e., how often outliers occur) with the distribution.
image:Dynamics
Leptokurtosis
Mesokurtosis
Platykurtosis
image:Bogleheads
The Variation Ratio
A measure of dispersion.
The proportion of cases which are not in the mode category.
The only measure of dispersion that can be used with categorical variables
\[v = 1 - \frac{fm}{n}\]
\(fm\) = the frequency (number of cases) of the mode
\(n\) = sample size
We asked 1,984 individuals at the University what their favorite color was. We were left with four colors: red, orange, green, and blue. What is the variation score?
colors
Red 43
Orange 211
Green 341
Blue 1389
\[v = 1 - \frac{fm}{n}\]
Dispersion
Kurtosis
Leptokurtosis
Mesokurtosis
Platykurtosis
The Variation Ratio
Range
Variance
Standard Deviation
A measure of the span of data.
A high range value indicates there is more dispersion, and the lower the range, the less the dispersion
\[Range = The \, Maximum \, Value - The \, Minimum \, Value\]
Below is a data frame that contains the average salary within each district. Find the range.
Salary
District1 87000
District2 91000
District3 66500
District4 98500
District5 96500
District6 97550
District7 97900
District8 97990
\[?\]
A measure of how spread the observed values are from the mean.
\[s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\]
\(s^2\) = sample variance
\(x_i\) = raw value
\(\bar{x}\) = mean values of all raw scores
\(n\) = number of observations
Find the \(\bar{x}\)
Subtract the mean from each score \((x_i - \bar{x})\)
Square the deviation score \((x_i - \bar{x})^2\)
Add the squared deviations \({\sum(x_i - \bar{x})^2}\)
Divide by \({n-1}\)
\[s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\]
Suppose we draw n independent observations from a population (N), with a unknown population mean \((\mu)\) and unknown variance \((s^2)\).
Ideally we would use \((s^2)\) to find the average squared distance from the true mean,
\[s^2 = \frac{\sum(x_i - \mu)^2}{n}\]
Although we can’t! 😢
Because we don’t know our \((\mu)\).
Since we don’t know \((\mu)\), we use our best estimate of it which is the sample mean \(\bar{x}\).
So let’s pull out the population variable \((\mu)\) and plug in our [sample variables (\(\bar{x}\)).
\[s^2 = \frac{\sum(x_i - \bar{x})^2}{n}\]
Although another small problems pops up!
Population and the true \({\mu}\)
\[(-5,0)*--*-*-*----(0,0)--\stackrel{\mu}{|}--*--*--*--- (5,0)\]
Sample and the \(\bar{x}\)
\[(-5,0)*--*-*-----(0,0)--\stackrel{\mu}{|}--------- (5,0)\]
The sample mean can tend to underestimate or even overestimate the true \({\mu}\)
We modify \(s^2 = \frac{\sum(x_i - \bar{x})^2}{n}\) to \(s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\)
And this provides an unbiased estimator of the population variance when using a sample.
image: khan academy
When we are measuring a population, divide by n. (hint: There will be a \(\mu\) in the deviation score.)
When we are measuring a sample, divide by n-1. (hint: There will be a \(\bar{x}\) in the deviation score.)
District Salary
1 District1 832
2 District2 931
3 District3 1468
4 District4 1021
5 District5 1039
6 District6 1515
7 District7 1138
8 District8 620
\[s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\]
District Salary
1 District1 983
2 District2 993
3 District3 1047
4 District4 1002
5 District5 1004
6 District6 1051
7 District7 1014
8 District8 962
\[s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\]
A measure of how far observed values are from the mean.
It’s simply the square root of our variance!
\[\sigma=\sqrt{\frac{\sum_{} (x_{i} - \bar{x})^2}{n-1}}\] \(\sigma\) = standard deviation
\(x_i\) = raw value
\(\bar{x}\) = mean values of all raw scores
\(n\) = number of observations
In our variance equation, we squared the sum of the deviation score \((x_i - \bar{x})\) to get rid of our negative values.
\[s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\] But by doing that we created an output that is nonsensical to our data and thus our interpretation.
\[\sigma=\sqrt{\frac{\sum_{} (x_{i} - \bar{x})^2}{n-1}}\]
So by squaring it, we normalize the sum of the deviation score, and thus the values make sense.
Take our calculation from variance example 1 and square it.
District Salary
1 District1 832
2 District2 931
3 District3 1468
4 District4 1021
5 District5 1039
6 District6 1515
7 District7 1138
8 District8 620
\[\sigma=\sqrt{\frac{\sum_{} (x_{i} - \bar{x})^2}{n-1}}\]
On average, 68% of the sample will fall within 1 standard deviation of the mean, 95% at 2 standard deviations, and 99.7% will fall within 3 standard deviations.
image: medium
Giphy